With this data set, I examine Boston crime data. I have placed an emphasis on readability and clarity in the graphics, so I hope that this document is informative about the data set!
library(tidyverse)
library(lubridate)
library(ggplot2)
library(scales)
library(leaflet)
library(dplyr)
library(sp)
library(leaflet.extras)
library(furrr)
library(sf)
library(RColorBrewer)
library(zoo)
library(caTools)
library(caret)
library(randomForest)
library(forecast)
theme_set(theme_minimal(base_family = "Times New Roman") + theme(
text = element_text(family = "Times New Roman")
))
options(dplyr.summarise.inform = FALSE)
data = read_csv("./Boston_Crime_Data.csv")
data$OCCURRED_ON_DATE = mdy_hm(data$OCCURRED_ON_DATE)
data = data %>%
separate(Location, into = c("Latitude", "Longitude"), sep = ",") %>%
mutate(Latitude = as.numeric(gsub("[()]", "", Latitude)),
Longitude = as.numeric(gsub("[()]", "", Longitude))) %>%
select(-Lat, -Long)
traffic_offenses <- c("M/V ACCIDENT - PROPERTY DAMAGE", "M/V ACCIDENT - INVOLVING BICYCLE - INJURY", "M/V ACCIDENT - PERSONAL INJURY", "M/V ACCIDENT - OTHER", "M/V ACCIDENT - PROPERTY DAMAGE", "M/V - LEAVING SCENE - PROPERTY DAMAGE", "M/V - LEAVING SCENE - PERSONAL INJURY", "M/V ACCIDENT - POLICE VEHICLE", "M/V ACCIDENT INVOLVING PEDESTRIAN - INJURY", "M/V ACCIDENT - INVOLVING PEDESTRIAN - NO INJURY", "M/V ACCIDENT - INVOLVING BICYCLE - NO INJURY", "M/V ACCIDENT - INVOLVING BICYCLE - INJURY", "VAL - OPERATING UNREG/UNINS CAR", "VAL - OPERATING WITHOUT LICENSE", "VAL - OPERATING AFTER REV/SUSP.", "VAL - VIOLATION OF AUTO LAW - OTHER", "INJURY BICYCLE NO M/V INVOLVED", "OPERATING UNDER THE INFLUENCE ALCOHOL", "OPERATING UNDER THE INFLUENCE DRUGS", "M/V ACCIDENT - INVOLVING BICYCLE - INJURY", "M/V ACCIDENT - INVOLVING PEDESTRIAN - INJURY", "M/V PLATES - LOST", "VAL - VIOLATION OF AUTO LAW", "OPERATING UNDER THE INFLUENCE (OUI) ALCOHOL", "OPERATING UNDER THE INFLUENCE (OUI) DRUGS", "M/V ACCIDENT - POLICE VEHICLE", "M/V ACCIDENT - INVOLVING PEDESTRIAN - NO INJURY", "M/V ACCIDENT - PERSONAL INJURY","OPERATING UNDER THE INFLUENCE ALCOHOL", "OPERATING UNDER THE INFLUENCE DRUGS", "LICENSE VIOLATION", "LICENSE PLATE RELATED INCIDENTS", "LICENSE PREMISE VIOLATION", "TOWED MOTOR VEHICLE", "VAL - OPERATING UNREG/UNINS CAR", "M/V ACCIDENT - PROPERTY DAMAGE", "AUTO THEFT - RECOVERED IN BY POLICE", "AUTO THEFT LEASE/RENT VEHICLE", "B&E NON-RESIDENCE DAY - FORCIBLE", "B&E NON-RESIDENCE DAY - NO FORCE", "B&E NON-RESIDENCE DAY - NO PROP TAKEN", "B&E NON-RESIDENCE NIGHT - FORCE", "B&E RESIDENCE DAY - NO FORCE", "B&E RESIDENCE DAY - NO PROP TAKEN", "B&E RESIDENCE NIGHT - ATTEMPT FORCE", "BURGLARY - COMMERICAL - NO FORCE", "M/V ACCIDENT - INVOLVING BICYCLE - INJURY", "M/V ACCIDENT - INVOLVING BICYCLE - INJURY", "M/V ACCIDENT - OTHER CITY VEHICLE", "M/V ACCIDENT - PROPERTY DAMAGE", "POSSESSION OF BURGLARIOUS TOOLS", "RECOVERED - MV RECOVERED IN BOSTON (STOLEN IN BOSTON) MUST BE SUPPLEMENTAL", "RECOVERED STOLEN PLATE", "VAL - OPERATING UNREG/UNINS CAR", "VAL - OPERATING W/O AUTHORIZATION LAWFUL", "M/V ACCIDENT - INVOLVING BICYCLE - INJURY", "M/V ACCIDENT - PROPERTY DAMAGE", "M/V ACCIDENT - INVOLVING BICYCLE - INJURY")
violent_crimes <- c("ASSAULT SIMPLE - BATTERY", "ASSAULT - AGGRAVATED - BATTERY", "ASSAULT - AGGRAVATED", "ASSAULT - SIMPLE", "BURGLARY - RESIDENTIAL - NO FORCE", "BURGLARY - RESIDENTIAL - ATTEMPT", "BURGLARY - RESIDENTIAL - FORCE", "ROBBERY - STREET", "ROBBERY - COMMERCIAL", "ROBBERY - CAR JACKING", "ROBBERY - BANK", "ROBBERY - OTHER", "ROBBERY - HOME INVASION", "ROBBERY - UNARMED - BUSINESS", "ROBBERY - KNIFE - STREET", "ROBBERY - UNARMED - STREET", "ROBBERY - UNARMED - RESIDENCE", "HOME INVASION", "MURDER, NON-NEGLIGIENT MANSLAUGHTER", "KIDNAPPING/CUSTODIAL KIDNAPPING", "KIDNAPPING - ENTICING OR ATTEMPTED", "SEX OFFENSE - RAPE - FORCIBLE", "SEX OFFENSE - RAPE - FONDLING", "SEX OFFENSE - RAPE - OTHER", "ARSON", "HOMICIDE", "HUMAN TRAFFICKING - INVOLUNTARY SERVITUDE", "HUMAN TRAFFICKING - COMMERCIAL SEX ACTS", "KIDNAPPING", "KIDNAPPING/CUSTODIAL KIDNAPPING", "KIDNAPPING - ENTICING OR ATTEMPTED", "THREATS TO DO BODILY HARM", "ASSAULT D/W - KNIFE ON POLICE OFFICER", "ASSAULT & BATTERY", "ASSAULT & BATTERY D/W - OTHER", "A&B ON POLICE OFFICER", "ASSAULT D/W - OTHER", "ASSAULT & BATTERY D/W - KNIFE", "A&B HANDS, FEET, ETC. - MED. ATTENTION REQ.", "ASSAULT & BATTERY D/W - OTHER ON POLICE OFFICER", "ROBBERY - UNARMED - CHAIN STORE", "ROBBERY", "HARASSMENT", "THREATS TO DO BODILY HARM", "CRIMINAL HARASSMENT", "Fondling - Indecent Assault", "SEX OFFENSE - RAPE - FONDLING", "SEX OFFENSE - RAPE - OTHER", "SEX OFFENSE - RAPE - SODOMY", "ABDUCTION - INTICING", "KIDNAPPING", "FIREARM/WEAPON - ACCIDENTAL INJURY / DEATH", "CHILD ABUSE", "KIDNAPPING/CUSTODIAL KIDNAPPING/ ABDUCTION", "MANSLAUGHTER - NON-VEHICLE - NEGLIGENCE", "MANSLAUGHTER - VEHICLE - NEGLIGENCE", "ROBBERY ATTEMPT - KNIFE - BANK", "HARASSMENT/ CRIMINAL HARASSMENT")
drug_firearm_offenses <- c("DRUGS - OTHER", "DRUGS - POSS CLASS B - COCAINE, ETC.", "DRUGS - POSS CLASS A - HEROIN, ETC.", "DRUGS - POSS CLASS B - INTENT TO MFR DIST DISP", "DRUGS - POSS CLASS D", "DRUGS - POSSESSION", "DRUGS - CONSP TO VIOL CONTROLLED SUBSTANCE", "DRUGS - POSS CLASS E", "DRUGS - POSS CLASS E - INTENT TO MFR DIST DISP", "FIREARM/WEAPON - FOUND OR CONFISCATED", "WEAPON - FIREARM - CARRYING / POSSESSING, ETC", "WEAPON - OTHER - OTHER VIOLATION", "BALLISTICS EVIDENCE/FOUND", "WEAPON - OTHER - CARRYING / POSSESSING, ETC", "WEAPON - FIREARM - OTHER VIOLATION", "FIREARM/WEAPON - POSSESSION OF DANGEROUS", "FIREARM/WEAPON - CARRY - SELL - RENT", "WEAPON - FIREARM - SALE / TRAFFICKING", "EXPLOSIVES - TURNED IN OR FOUND", "EXPLOSIVES - POSSESSION OR USE", "FIREARM/WEAPON - FOUND OR CONFISCATED", "FIREARM/WEAPON - LOST", "DRUGS - SALE / MANUFACTURING", "DRUGS - POSS CLASS C", "DRUGS - POSS CLASS A - INTENT TO MFR DIST DISP", "DRUGS - SICK ASSIST - HEROIN", "DRUGS - SICK ASSIST - OTHER NARCOTIC", "DRUGS - SICK ASSIST - OTHER HARMFUL DRUG", "DRUGS - POSS CLASS D - INTENT TO MFR DIST DISP", "DRUGS - POSS CLASS E INTENT TO MF DIST DISP", "DRUGS - POSS CLASS D - MARIJUANA, ETC.", "DRUGS - CLASS D TRAFFICKING OVER 50 GRAMS", "DRUGS - GLUE INHALATION", "DRUGS - POSSESSION/ SALE/ MANUFACTURING/ USE", "FIREARM/WEAPON - ACCIDENTAL INJURY / DEATH", "WEAPON VIOLATION - CARRY/ POSSESSING/ SALE/ TRAFFICKING/ OTHER", "WEAPON - FIREARM - CARRYING / POSSESSION", "WEAPON - FIREARM - SALE / TRAFFICKING", "WEAPON - FIREARM - OTHER VIOLATION", "WEAPON - OTHER - OTHER VIOLATION", "WEAPON - OTHER - CARRYING / POSSESSION", "WEAPON - OTHER - SALE / TRAFFICKING", "DRUGS - CLASS A TRAFFICKING OVER 18 GRAMS", "DRUGS - CLASS B TRAFFICKING OVER 18 GRAMS", "DRUGS - POSS CLASS C - INTENT TO MFR DIST DISP", "DRUGS - POSS CLASS D - INTENT MFR DIST DISP", "DRUGS - POSSESSION OF DRUG PARAPHANALIA", "SICK ASSIST - DRUG RELATED ILLNESS")
property_crimes <- c("PROPERTY - LOST", "LARCENY THEFT FROM MV - NON-ACCESSORY", "VANDALISM", "LARCENY ALL OTHERS", "LARCENY SHOPLIFTING", "LARCENY THEFT FROM BUILDING", "LARCENY THEFT OF BICYCLE", "LARCENY PURSE SNATCH - NO FORCE", "LARCENY PICK-POCKET", "BURGLARY - COMMERCIAL - NO FORCE", "BURGLARY - COMMERCIAL - ATTEMPT", "BURGLARY - COMMERCIAL - FORCE", "BURGLARY - OTHER - NO FORCE", "BURGLARY - OTHER - FORCE", "BURGLARY - OTHER - ATTEMPT", "TRESPASSING", "STOLEN PROPERTY - BUYING / RECEIVING / POSSESSING", "PROPERTY - DAMAGE", "PROPERTY - STOLEN THEN RECOVERED", "AUTO THEFT", "LARCENY THEFT OF MV PARTS & ACCESSORIES", "BURGLARY - COMMERICAL - FORCE", "AUTO THEFT - LEASED/RENTED VEHICLE", "RECOVERED - MV RECOVERED IN BOSTON (STOLEN OUTSIDE BOSTON)", "AUTO THEFT - MOTORCYCLE / SCOOTER", "PROPERTY - MISSING", "LARCENY IN A BUILDING UNDER $50", "LARCENY OTHER UNDER $50", "LARCENY SHOPLIFTING UNDER $50", "LARCENY SHOPLIFTING $50 TO $199", "LARCENY SHOPLIFTING $200 & OVER", "LARCENY BICYCLE $200 & OVER", "LARCENY OTHER $200 & OVER", "LARCENY IN A BUILDING $50 TO $199", "LARCENY OTHER $50 TO $199", "LARCENY IN A BUILDING $200 & OVER", "LARCENY NON-ACCESSORY FROM VEH. $50 TO $199", "LARCENY NON-ACCESSORY FROM VEH. $200 & OVER", "LARCENY NON-ACCESSORY FROM VEH. UNDER $50", "AUTO THEFT OTHER", "PROPERTY - RECEIVING STOLEN", "AUTO THEFT - OUTSIDE - RECOVERED IN BOSTON", "BREAKING AND ENTERING (B&E) MOTOR VEHICLE", "M/V ACCIDENT - PROPERTY DAMAGE", "BURGLARY - RESIDENTIAL", "BURGLARY - COMMERICAL", "BREAKING AND ENTERING (B&E) MOTOR VEHICLE (NO PROPERTY STOLEN)", "PROPERTY - LOST/ MISSING", "BURGLARY - COMMERICAL - ATTEMPT")
public_order_crimes <- c("DISORDERLY CONDUCT", "NOISY PARTY/RADIO-ARREST", "NOISY PARTY/RADIO-NO ARREST", "DISTURBING THE PEACE", "DRINKING IN PUBLIC", "GAMBLING - BETTING / WAGERING", "PROSTITUTION - COMMON NIGHTWALKER", "PROSTITUTION - SOLICITING", "LIQUOR - DRINKING IN PUBLIC", "LIQUOR LAW VIOLATION", "PROSTITUTION", "FUGITIVE FROM JUSTICE", "VIOLATION - CITY ORDINANCE", "VIOLATION - HAWKER AND PEDDLER", "DEMONSTRATIONS/RIOT", "AFFRAY", "DISORDERLY PERSON", "GRAFFITI", "DISTURBING THE PEACE/ DISORDERLY CONDUCT/ GATHERING CAUSING ANNOYANCE/ NOISY PAR", "LIQUOR/ALCOHOL - DRINKING IN PUBLIC", "DRUNKENNESS", "ANNOYING AND ACCOSTING", "OBSCENE MATERIALS - PORNOGRAPHY", "PROSTITUTION - ASSISTING OR PROMOTING", "VERBAL DISPUTE", "ANNOYING AND ACCOSTIN", "GATHERING CAUSING ANNOYANCE", "LANDLORD - TENANT", "OBSCENE PHONE CALLS", "PRISONER ATTEMPT TO RESCUE", "PROSTITUTE - COMMON NIGHTWALKER", "PROTECTIVE CUSTODY / SAFEKEEPING", "TRUANCY / RUNAWAY", "CONTRIBUTING TO DELINQUENCY OF MINOR")
fraud_financial_crimes = c("FRAUD - WIRE", "FRAUD - CREDIT CARD / ATM FRAUD", "FRAUD - IMPERSONATION", "FRAUD - FALSE PRETENSE / SCHEME", "FRAUD - WELFARE", "FRAUD - BANK EMBEZZLEMENT", "FRAUD - EMBEZZLEMENT", "FRAUD - FORGERY / COUNTERFEITING", "FRAUD - ILLEGAL USE CREDIT CARDS", "FRAUD - COMPUTER RELATED", "FRAUD - TELEPHONE SERVICE", "FRAUD - TELEMARKETING", "FRAUD - BUSINESS EMBEZZLEMENT", "FRAUD - OTHER", "EXTORTION", "EMBEZZLEMENT", "EMBEZZLEMENT - RETAIL", "FORGERY / COUNTERFEITING", "FRAUDS - ALL OTHER", "COUNTERFEITING", "LARCENY THEFT FROM COIN-OP MACHINE", "FRAUD - FALSE PRETENSE", "FORGERY OR UTTERING", "CUSTODIAL KIDNAPPING", "EXTORTION OR BLACKMAIL", "PROPERTY - CONCEALING LEASED", "VIOLATION - CITY ORDINANCE CONSTRUCTION PERMIT")
other_crimes <- c("INVESTIGATE PERSON", "INVESTIGATE PROPERTY", "MISSING PERSON", "MISSING PERSON - LOCATED", "LANDLORD - TENANT SERVICE", "ANIMAL INCIDENTS", "ANIMAL CONTROL - DOG BITES - ETC.", "PROPERTY - FOUND", "PROPERTY - LOST THEN FOUND", "PROPERTY - ACCIDENTAL DAMAGE", "HARBOR INCIDENT / VIOLATION", "SEARCH WARRANT", "PRISONER - SUICIDE / SUICIDE ATTEMPT", "PRISONER ESCAPE / ESCAPE & RECAPTURE", "AIRCRAFT INCIDENTS", "BOMB THREAT", "FIRE REPORT/ALARM - FALSE", "FIRE", "FIRE REPORT - HOUSE, BUILDING, ETC.", "FIRE REPORT - CAR, BRUSH, ETC.", "FIRE REPORT - STRUCTURE - NO FIRE", "FIRE REPORT - VEHICLE", "CONSPIRACY EXCEPT DRUG LAW", "WARRANT ARREST", "VIOL. OF RESTRAINING ORDER W NO ARREST", "SUDDEN DEATH", "SERVICE TO OTHER PD INSIDE OF MA.", "DEATH INVESTIGATION", "MISSING PERSON - NOT REPORTED - LOCATED", "CHILD ABANDONMENT (NO ASSAULT)", "SICK/INJURED/MEDICAL - POLICE", "CHILD ENDANGERMENT", "PROPERTY - LOST THEN LOCATED", "INJURY BICYCLE NO M/V INVOLVED", "INVESTIGATION FOR ANOTHER AGENCY", "MISSING PERSON", "MISSING PERSON REPORT", "SEARCH WARRANT", "FIRE REPORT/ALARM - FALSE", "SICK/INJURED/MEDICAL - PERSON", "OTHER OFFENSE", "ANIMAL ABUSE", "CHILD ENDANGERMENT (NO ASSAULT)", "DANGEROUS OR HAZARDOUS CONDITION", "ANIMAL INCIDENTS (DOG BITES, LOST DOG, ETC)", "BIOLOGICAL THREATS", "CHILD ENDANGERMENT/ABANDONMENT (NO ASSAULT)", "CHILD REQUIRING ASSISTANCE (FOMERLY CHINS)", "CHINS", "EVADING FARE", "FIRE REPORT", "INTIMIDATING WITNESS", "REPORT AFFECTING OTHER DEPTS.", "SERVICE TO OTHER AGENCY", "SERVICE TO OTHER PD OUTSIDE OF MA.", "SICK ASSIST", "STALKING", "SUICIDE / SUICIDE ATTEMPT")
crimesort = function(crime) {
if (crime %in% violent_crimes) {
return("Violent Crime")
} else if (crime %in% property_crimes) {
return("Property Crime")
} else if (crime %in% drug_firearm_offenses) {
return("Drug and Firearm Crime")
} else if (crime %in% fraud_financial_crimes) {
return("Fraud and Financial Crime")
} else if (crime %in% public_order_crimes) {
return("Public Order Crime")
} else if (crime %in% traffic_offenses) {
return("Traffic Offense")
} else if (crime %in% other_crimes) {
return("Other")
} else {
return("Unmatched")
}
}
Zip code coordinates come from this website: https://gist.github.com/erichurst/7882666. Zip code population data is from the 2010 census, can be found here: https://data.census.gov/table?q=ZCTA5+02129+Populations+and+People&tid=DECENNIALSF12010.P1. Haversine formula comes from here: https://en.wikipedia.org/wiki/Haversine_formula.
data = data %>%
mutate(crime_category = unlist(lapply(OFFENSE_DESCRIPTION, crimesort)))
zipsGeo = tribble(~zip, ~lat, ~long,
"02108",42.357768, -71.064858,
"02109",42.367032, -71.050493,
"02110",42.361962, -71.047846,
"02111",42.350518, -71.059077,
"02113",42.365331, -71.055233,
"02114",42.363174, -71.068646,
"02115",42.337105, -71.105696,
"02116",42.350579, -71.076397,
"02118",42.337582, -71.070482,
"02119",42.324029, -71.085017,
"02120",42.332090, -71.096545,
"02121",42.306267, -71.085897,
"02122",42.291413, -71.042158,
"02124",42.285805, -71.070571,
"02125",42.315682, -71.055555,
"02126",42.274227, -71.097423,
"02127",42.334992, -71.039093,
"02128",42.361129, -71.006975,
"02129",42.379657, -71.061487,
"02130",42.309174, -71.113835,
"02131",42.284333, -71.126228,
"02132",42.280455, -71.162017,
"02134",42.358016, -71.128608,
"02135",42.349688, -71.153964,
"02136",42.255083, -71.129220,
"02163",42.366168, -71.122850,
"02199",42.347476, -71.082035,
"02203",42.360598, -71.058775,
"02215",42.347635, -71.103082,
"02210",42.347472, -71.039271,
)
plan(multisession)
find_nearest_zip_haversine <- function(lat, long, zipsGeo) {
if (lat %in% c(-1, 0, 1) || long %in% c(-1, 0, 1)) {
return("00")
}
R <- 6371 # This is the radius of the earth in KM
lat_rad <- as.numeric(lat) * pi / 180
long_rad <- as.numeric(long) * pi / 180
zips_lat_rad <- as.numeric(zipsGeo$lat) * pi / 180
zips_long_rad <- as.numeric(zipsGeo$long) * pi / 180
dlat <- zips_lat_rad - lat_rad
dlong <- zips_long_rad - long_rad
a <- sin(dlat / 2)^2 + cos(lat_rad) * cos(zips_lat_rad) * sin(dlong / 2)^2
c <- 2 * atan2(sqrt(a), sqrt(1 - a))
distances <- R * c
nearest_zip <- zipsGeo[which.min(distances), "zip"]
return(as.character(nearest_zip))
}
data$Zip <- unlist(future_map2(data$Latitude, data$Longitude, find_nearest_zip_haversine, zipsGeo = zipsGeo))
file_path = "./zipsPop.csv"
zipsPop = read_csv(file_path)
long_data = zipsPop %>%
gather(key = "Zip", value = "Pop")
long_data$Zip = gsub("ZCTA5 ", "", long_data$Zip)
zip_codes = zipsGeo$zip
filtered_data = long_data %>%
filter(Zip %in% zip_codes)
data <- data %>%
left_join(filtered_data, by = "Zip")
crime_count <- data %>%
group_by(Zip) %>%
summarize(n_crimes = n())
data <- left_join(data, crime_count, by = "Zip") %>%
mutate(crime_rate = n_crimes / as.numeric(Pop))
data[sapply(data, is.infinite)] <- NA
data %>%
select(Zip, Pop) %>%
unique() %>%
select(Pop) %>%
filter(!is.na(Pop))-> d
boston_population = sum(as.numeric(d$Pop))
# Monthly crime count per zip code
zip_monthly_crime_count <- data %>%
filter(Zip != "00" & Zip != "NA") %>%
group_by(Zip, YEAR, MONTH) %>%
summarize(zip_monthly_crime_count = n())
total_monthly_crime_count = data %>%
group_by(YEAR, MONTH) %>%
summarize(total_monthly_crime_count = n())
data_with_monthly_crime <- data %>%
filter(Zip != "00" & Zip != "NA") %>%
left_join(zip_monthly_crime_count, by = c("Zip", "YEAR", "MONTH")) %>%
left_join(total_monthly_crime_count, by = c("YEAR", "MONTH"))
# Monthly Crime Rate
data_with_monthly_crime <- data_with_monthly_crime %>%
mutate(zip_monthly_crime_rate = zip_monthly_crime_count / as.numeric(Pop)) %>%
mutate(total_monthly_crime_rate = total_monthly_crime_count / as.numeric(boston_population))
data_with_monthly_crime[sapply(data_with_monthly_crime, is.infinite)] <- NA
reference_date <- as.Date("2015-06-01")
data_with_monthly_crime <- data_with_monthly_crime %>%
mutate(date = as.Date(OCCURRED_ON_DATE, format = "%Y-%m-%d")) %>%
mutate(months_since = floor(as.numeric(interval(reference_date, date) / months(1)))) %>%
mutate(months_since = months_since)
data_with_monthly_crime$factor_zip = as.factor(data_with_monthly_crime$Zip)
The dataset consists of almost 500,000 police incidents in Boston from June 2015 to March 2020. Each observation has a unique identifying number, the police code, a description of the incident, and information about the time and location of the incident.
str(data)
## tibble [476,655 × 21] (S3: tbl_df/tbl/data.frame)
## $ INCIDENT_NUMBER : chr [1:476655] "I182061268" "I172040657" "I162013546" "I152067251" ...
## $ OFFENSE_CODE : num [1:476655] 3201 2629 3201 1102 2647 ...
## $ OFFENSE_CODE_GROUP : chr [1:476655] "Property Lost" "Harassment" "Property Lost" "Fraud" ...
## $ OFFENSE_DESCRIPTION: chr [1:476655] "PROPERTY - LOST" "HARASSMENT" "PROPERTY - LOST" "FRAUD - FALSE PRETENSE / SCHEME" ...
## $ DISTRICT : chr [1:476655] NA "C11" "B3" "A1" ...
## $ REPORTING_AREA : num [1:476655] NA 397 433 93 359 456 20 20 282 289 ...
## $ SHOOTING : chr [1:476655] NA NA NA NA ...
## $ OCCURRED_ON_DATE : POSIXct[1:476655], format: "2015-06-15 00:00:00" "2015-06-15 00:00:00" ...
## $ YEAR : num [1:476655] 2015 2015 2015 2015 2015 ...
## $ MONTH : num [1:476655] 6 6 6 6 6 6 6 6 6 6 ...
## $ DAY_OF_WEEK : chr [1:476655] "Monday" "Monday" "Monday" "Monday" ...
## $ HOUR : num [1:476655] 0 0 0 0 0 0 0 0 0 0 ...
## $ UCR_PART : chr [1:476655] "Part Three" "Part Two" "Part Three" "Part Two" ...
## $ STREET : chr [1:476655] "BERNARD" "MELBOURNE ST" "NORFOLK ST" "FANEUIL HALL SQ" ...
## $ Latitude : num [1:476655] -1 42.3 42.3 42.4 42.3 ...
## $ Longitude : num [1:476655] -1 -71.1 -71.1 -71.1 -71.1 ...
## $ crime_category : chr [1:476655] "Property Crime" "Violent Crime" "Property Crime" "Fraud and Financial Crime" ...
## $ Zip : chr [1:476655] "00" "02124" "02124" "02203" ...
## $ Pop : chr [1:476655] NA "47783" "47783" "0" ...
## $ n_crimes : int [1:476655] 29111 41457 41457 12936 41457 39812 13142 13142 33535 32821 ...
## $ crime_rate : num [1:476655] NA 0.868 0.868 NA 0.868 ...
I used a 14 day rolling average to smooth the data so that it is more readable.
daily_crime_count <- data %>%
group_by(date = as.Date(OCCURRED_ON_DATE)) %>%
summarize(crime_count = n())
rolling_window <- 14
daily_crime_count <- daily_crime_count %>%
mutate(rolling_avg = rollmean(crime_count, k = rolling_window, fill = NA, align = "right"))
ggplot(daily_crime_count, aes(x = date)) +
geom_line(aes(y = rolling_avg), color = "red", size = 0.7) +
labs(title = "Number of Crimes per Day (Rolling Average)", x = "Date", y = "Crimes Per Day") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 11, family = "Times New Roman"),
axis.text.y = element_text(size = 11, family = "Times New Roman"),
axis.title = element_text(size = 12, family = "Times New Roman"),
plot.title = element_text(size = 14, family = "Times New Roman", hjust = 0.5)) +
scale_x_date(
breaks = seq(min(daily_crime_count$date), max(daily_crime_count$date), by = "6 months"),
labels = function(x) format(x, "%b %Y")
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 13 rows containing missing values (`geom_line()`).
hourly_crime_count <- data %>%
group_by(HOUR) %>%
summarize(crime_count = n())
total_serious_crime_count <- sum(hourly_crime_count$crime_count)
hourly_crime_count <- hourly_crime_count %>%
mutate(relative_frequency = crime_count / total_serious_crime_count * 100)
ggplot(hourly_crime_count, aes(x = HOUR, y = relative_frequency)) +
geom_bar(stat = "identity", fill = "#006666") +
labs(title = "Crime Frequency by Time of Day", x = "Hour of Day", y = "Relative Frequency (%)") +
theme_minimal() +
theme(text = element_text(family = "Times New Roman"),
plot.title = element_text(hjust = 0.5))
#Crimes by day of week
daily_crime_count <- data %>%
group_by(DAY_OF_WEEK) %>%
summarize(crime_count = n())
daily_crime_count$DAY_OF_WEEK <- factor(daily_crime_count$DAY_OF_WEEK, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
total_crime_count <- sum(daily_crime_count$crime_count)
daily_crime_count <- daily_crime_count %>%
mutate(relative_frequency = crime_count / total_crime_count * 100)
ggplot(daily_crime_count, aes(x = DAY_OF_WEEK, y = relative_frequency)) +
geom_bar(stat = "identity", fill = "#006666") +
labs(title = "Crimes by Day of Week", x = "Day of Week", y = "Relative Frequency (%)") +
theme_minimal() +
theme(text = element_text(family = "Times New Roman"), plot.title = element_text(hjust = 0.5))
monthly_crime_count <- data %>%
group_by(MONTH) %>%
summarize(crime_count = n())
total_crime_count <- sum(monthly_crime_count$crime_count)
monthly_crime_count <- monthly_crime_count %>%
mutate(relative_frequency = crime_count / total_crime_count)
monthly_crime_count$MONTH_NAME <- factor(month.abb[monthly_crime_count$MONTH],
levels = month.abb)
ggplot(monthly_crime_count, aes(x = MONTH_NAME, y = relative_frequency)) +
geom_bar(stat = "identity", fill = "#006666") +
labs(title = "Crimes by Month", x = "Month", y = "Relative Frequency (%)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5), text = element_text(family = "Times New Roman")) +
scale_y_continuous(labels = scales::percent_format())
category_counts <- data %>%
group_by(crime_category) %>%
summarize(crime_count = n())
total_crime_count <- sum(category_counts$crime_count)
category_counts <- category_counts %>%
mutate(relative_frequency = (crime_count / total_crime_count) * 100)
ggplot(category_counts, aes(x = reorder(crime_category, -relative_frequency), y = relative_frequency)) +
geom_bar(stat = "identity", fill = "#006666") +
labs(title = "Crime Distributions Based on Categories", x = "Crime Category", y = "Relative Frequency (%)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
text = element_text(family = "Times New Roman"))
daily_crime_count <- data %>%
filter(crime_category == "Violent Crime" | crime_category == "Fraud and Financial Crime" | crime_category == "Property Crime" | crime_category == "Drug and Firearm Crime") %>%
group_by(date = as.Date(OCCURRED_ON_DATE)) %>%
summarize(crime_count = n())
rolling_window <- 14
daily_crime_count <- daily_crime_count %>%
mutate(rolling_avg = rollmean(crime_count, k = rolling_window, fill = NA, align = "right"))
ggplot(daily_crime_count, aes(x = date)) +
geom_line(aes(y = rolling_avg), color = "red", size = 0.7) +
labs(title = "Number of Property/Violent Crimes per Day (Rolling Average)", x = "Date", y = "Crimes Per Day") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 11, family = "Times New Roman"),
axis.text.y = element_text(size = 11, family = "Times New Roman"),
axis.title = element_text(size = 12, family = "Times New Roman"),
plot.title = element_text(size = 14, family = "Times New Roman", hjust = 0.5)) +
scale_x_date(
breaks = seq(min(daily_crime_count$date), max(daily_crime_count$date), by = "6 months"),
labels = function(x) format(x, "%b %Y")
)
## Warning: Removed 13 rows containing missing values (`geom_line()`).
daily_crime_count <- data %>%
filter(crime_category == "Traffic Offense" | crime_category == "Public Order Crime") %>%
group_by(date = as.Date(OCCURRED_ON_DATE)) %>%
summarize(crime_count = n())
rolling_window <- 14
daily_crime_count <- daily_crime_count %>%
mutate(rolling_avg = rollmean(crime_count, k = rolling_window, fill = NA, align = "right"))
ggplot(daily_crime_count, aes(x = date)) +
geom_line(aes(y = rolling_avg), color = "red", size = 0.7) +
labs(title = "Number of Petty Crimes per Day (Rolling Average)", x = "Date", y = "Crimes Per Day") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 11, family = "Times New Roman"),
axis.text.y = element_text(size = 11, family = "Times New Roman"),
axis.title = element_text(size = 12, family = "Times New Roman"),
plot.title = element_text(size = 14, family = "Times New Roman", hjust = 0.5)) +
scale_x_date(
breaks = seq(min(daily_crime_count$date), max(daily_crime_count$date), by = "6 months"),
labels = function(x) format(x, "%b %Y")
)
## Warning: Removed 13 rows containing missing values (`geom_line()`).
daily_crime_count <- data %>%
group_by(DAY_OF_WEEK, crime_category) %>%
summarize(crime_count = n())
daily_crime_count$DAY_OF_WEEK <- factor(daily_crime_count$DAY_OF_WEEK, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
total_crime_count <- sum(daily_crime_count$crime_count)
daily_crime_count <- daily_crime_count %>%
mutate(relative_frequency = crime_count / total_crime_count * 100) %>%
filter(crime_category == "Violent Crime" | crime_category == "Fraud and Financial Crime" | crime_category == "Property Crime" | crime_category == "Drug and Firearm Crime" | crime_category == "Traffic Offense" | crime_category == "Public Order Crime")
ggplot(daily_crime_count, aes(x = DAY_OF_WEEK, y = relative_frequency, fill = crime_category)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Crimes by Day of Week", x = "Day of Week", y = "Relative Frequency (%)") +
theme_minimal() +
theme(text = element_text(family = "Times New Roman"), plot.title = element_text(hjust = 0.5))
hourly_crime_count <- data %>%
group_by(HOUR, crime_category) %>%
summarize(crime_count = n())
total_serious_crime_count <- sum(hourly_crime_count$crime_count)
hourly_crime_count <- hourly_crime_count %>%
mutate(relative_frequency = crime_count / total_serious_crime_count * 100) %>%
filter(crime_category == "Violent Crime" | crime_category == "Fraud and Financial Crime" | crime_category == "Property Crime" | crime_category == "Drug and Firearm Crime" | crime_category == "Traffic Offense" | crime_category == "Public Order Crime")
ggplot(hourly_crime_count, aes(x = HOUR, y = relative_frequency, fill = crime_category)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Crime Frequency by Time of Day", x = "Hour of Day", y = "Relative Frequency (%)") +
theme_minimal() +
theme(text = element_text(family = "Times New Roman"),
plot.title = element_text(hjust = 0.5),
legend.title = element_blank(),
legend.position = "bottom",
legend.box.margin = ggplot2::margin(l = -50),
legend.text = element_text(size = 9.5, family = "Times New Roman"))
monthly_crime_count <- data %>%
group_by(MONTH, crime_category) %>%
summarize(crime_count = n())
total_crime_count <- sum(monthly_crime_count$crime_count)
monthly_crime_count <- monthly_crime_count %>%
mutate(relative_frequency = crime_count / total_crime_count) %>%
filter(crime_category == "Violent Crime" | crime_category == "Fraud and Financial Crime" | crime_category == "Property Crime" | crime_category == "Drug and Firearm Crime" | crime_category == "Traffic Offense" | crime_category == "Public Order Crime")
monthly_crime_count$MONTH_NAME <- factor(month.abb[monthly_crime_count$MONTH],
levels = month.abb)
ggplot(monthly_crime_count, aes(x = MONTH_NAME, y = relative_frequency, fill = crime_category)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Crimes by Month", x = "Month", y = "Relative Frequency (%)") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5), text = element_text(family = "Times New Roman"))
## Crime By Location
violent_crime_data <- data %>%
filter(crime_category == "Violent Crime" | crime_category == "Property Crime" | crime_category == "Drug and Firearm Crime" | crime_category == "Traffic Offense" | crime_category == "Fraud and Financial Crime" | crime_category == "Public Order Crime") %>%
filter(Latitude != 0 & Latitude != -1)
aggregate <- violent_crime_data %>%
group_by(Latitude, Longitude) %>%
summarise(intensity = n()) %>%
ungroup()
leaflet(data = aggregate) %>%
addTiles() %>%
addHeatmap(
lat = ~Latitude,
lng = ~Longitude,
intensity = ~intensity,
blur = 55,
radius =20,
gradient = colorRampPalette(c("blue", "green", "yellow", "red", "black"))(1000)
)
Crime rates were calculated by total number of (monthly, in this case) crimes over the number of people in that zip code.
A map of Boston zipcodes can be found here: https://www.cityofboston.gov/images_documents/ZipCodes_tcm3-47884.pdf
ggplot(data_with_monthly_crime, aes(x = date, y = zip_monthly_crime_rate, color = Zip)) +
geom_line(size = 0.7) +
labs(title = "Crime Rate By Zip Code", x = "Date", y = "Crimes Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 11, family = "Times New Roman"),
axis.text.y = element_text(size = 11, family = "Times New Roman"),
axis.title = element_text(size = 12, family = "Times New Roman"),
plot.title = element_text(size = 14, family = "Times New Roman", hjust = 0.5))
## Warning: Removed 12936 rows containing missing values (`geom_line()`).
data_with_monthly_crime %>%
filter(Zip == "02109" | Zip == "02108" | Zip == "02111") %>%
ggplot(aes(x = date, y = zip_monthly_crime_rate, color = Zip)) +
geom_line(size = 0.7) +
labs(title = "Crime Rate By Zip Code", x = "Date", y = "Crimes Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 11, family = "Times New Roman"),
axis.text.y = element_text(size = 11, family = "Times New Roman"),
axis.title = element_text(size = 12, family = "Times New Roman"),
plot.title = element_text(size = 14, family = "Times New Roman", hjust = 0.5))
Crimes by zipcode
The shapefile comes from the local Boston government, and can be found here: https://data.boston.gov/dataset/zip-codes
crime_count <- data %>%
filter(crime_category == "Drug and Firearm Crime" | crime_category == "Property Crime" | crime_category == "Violent Crime" | crime_category == "Traffic Offense" | crime_category == "Fraud and Financial Crime" | crime_category == "Public Order Crime") %>%
group_by(Zip) %>%
summarize(n_crimes = n())
filtered_pop_data <- filtered_data %>%
filter(Pop != 0 & !is.na(Pop))
crime_rate_data <- left_join(crime_count, filtered_pop_data, by = "Zip") %>%
mutate(crime_rate = n_crimes / as.numeric(Pop))
boston_zipcodes <- st_read("zips.shp")
## Reading layer `zips' from data source
## `/Users/tommyschupp/Desktop/Programming/Jobs/BASE/BASETommySchupp/zips.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 43 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 739797.7 ymin: 2908274 xmax: 803996.3 ymax: 2971831
## Projected CRS: NAD83 / Massachusetts Mainland (ftUS)
boston_zipcodes <- st_transform(boston_zipcodes, 4326)
crime_rate_data = crime_rate_data %>%
rename(ZIP5 = Zip)
boston_zipcodes_crime <- left_join(boston_zipcodes, crime_rate_data, by = "ZIP5") %>%
filter(ZIP5 != "02199")
color_palette <- colorRampPalette(c("white", "darkred"))
pal <- colorNumeric(palette = color_palette(10), domain = boston_zipcodes_crime$crime_rate)
outlier_color <- "blue"
leaflet(boston_zipcodes_crime) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addPolygons(
fillColor = pal(boston_zipcodes_crime$crime_rate),
weight = 1,
opacity = 1,
color = "white",
fillOpacity = 0.7,
highlight = highlightOptions(
weight = 3,
color = "#666",
fillOpacity = 0.7,
bringToFront = TRUE
),
label = ~paste0("Zip: ", ZIP5, "Crime Rate: ", round(crime_rate, 6))
) %>%
addLegend(
"bottomright",
pal = pal,
values = ~crime_rate,
title = "Crime Rate",
opacity = 1
)
I used the ARIMA timeseries forecasting model.
data_with_monthly_crime %>%
select(months_since, total_monthly_crime_rate) %>%
unique() -> d
inputVector = d$total_monthly_crime_rate
# Create a time series object
ts_data <- ts(inputVector, start = c(2015, 6), frequency = 12)
# Split the data into a training set and a testing set
train_data <- window(ts_data, end = c(2019, 9))
test_data <- window(ts_data, start = c(2019, 9), end = c(2020, 2))
# Identify the optimal parameters for the ARIMA model
optimal_model <- auto.arima(train_data)
# Forecast future crime rates
forecasts <- forecast(optimal_model, h = length(test_data))
# Evaluate the model's performance
accuracy(forecasts, test_data)
## ME RMSE MAE MPE MAPE
## Training set 1.484250e-04 0.0010570684 0.0006592465 0.32327218 5.439665
## Test set 1.302099e-05 0.0001347637 0.0001066426 0.06304021 0.804712
## MASE ACF1 Theil's U
## Training set 1.1644057 -0.2907182 NA
## Test set 0.1883593 -0.1261956 0.1286343
# Plot the forecasts
autoplot(forecasts) + autolayer(test_data, series = "Actual") + xlab("Year") + ylab("Monthly Crime Rate") +
ggtitle("ARIMA Model: Forecasted vs Actual Monthly Crime Rates")
data_with_monthly_crime %>%
select(months_since, total_monthly_crime_rate) %>%
unique() -> d
inputVector = d$total_monthly_crime_rate
# Create a time series object
ts_data <- ts(inputVector, start = c(2015, 6), frequency = 12)
# Split the data into a training set and a testing set
train_data <- window(ts_data, end = c(2019, 9))
test_data <- window(ts_data, start = c(2019, 9), end = c(2020, 4))
# Identify the optimal parameters for the ARIMA model
optimal_model <- auto.arima(train_data)
# Forecast future crime rates
forecasts <- forecast(optimal_model, h = length(test_data))
# Evaluate the model's performance
accuracy(forecasts, test_data)
## ME RMSE MAE MPE MAPE
## Training set 0.000148425 0.001057068 0.0006592465 0.3232722 5.439665
## Test set -0.001424698 0.003315008 0.0015101721 -31.6635536 32.283377
## MASE ACF1 Theil's U
## Training set 1.164406 -0.2907182 NA
## Test set 2.667368 0.1239747 1.159942
# Plot the forecasts
autoplot(forecasts) + autolayer(test_data, series = "Actual") + xlab("Year") + ylab("Monthly Crime Rate") +
ggtitle("ARIMA Model: Forecasted vs Actual Monthly Crime Rates")
I attempted to use ARIMA to look at crime category or zip code specific data, but the datasets were not large enough.
data_with_monthly_crime %>%
filter(Zip == "02109" | Zip == "02108" | Zip == "02111") %>%
select(months_since, zip_monthly_crime_rate) %>%
unique() -> d
inputVector = d$zip_monthly_crime_rate
# Create a time series object
ts_data <- ts(inputVector, start = c(2015, 6), frequency = 12)
# Split the data into a training set and a testing set
train_data <- window(ts_data, end = c(2019, 9))
test_data <- window(ts_data, start = c(2019, 9), end = c(2020, 4))
# Identify the optimal parameters for the ARIMA model
optimal_model <- auto.arima(train_data)
# Forecast future crime rates
forecasts <- forecast(optimal_model, h = length(test_data))
# Evaluate the model's performance
accuracy(forecasts, test_data)
## ME RMSE MAE MPE MAPE MASE
## Training set -1.799776e-17 0.011408870 0.008774141 -6.467744 19.84483 0.7358916
## Test set -2.744766e-03 0.008568353 0.008058406 -8.192793 16.78555 0.6758626
## ACF1 Theil's U
## Training set -0.01404801 NA
## Test set 0.07763858 0.895647
# Plot the forecasts
autoplot(forecasts) + autolayer(test_data, series = "Actual") + xlab("Year") + ylab("Monthly Crime Rate") +
ggtitle("ARIMA Model: Forecasted vs Actual Monthly Crime Rates")
We saw a number of trends about crime in Boston over this time period. Excluding COVID, crime does not seem to be getting significantly better in Boston over this time period. We saw that there were specific peaks in hour, day of the week, and month where more crimes happen. We saw that different zip codes have vastly different numbers of crimes committed/crime rates, and also that crime rates are very different from number of crimes. Finally, we saw that we can predict crime patterns based on time with quite a bit of accuracy.
There were a few issues with the analyses that I conducted that are important to point out. First, the ZIP code population data is from 2010. This might not line up with the 2015-2020 population of Boston. Also, my way of categorizing the code was quite arbitrary, and there might be better ways, especially by looking at the legal classifications of the crime types. Also, the COVID-19 Pandemic had a huge impact on crime rates, and only the first month of that time is considered. However, that month could have an impact on the models and data that might not reflect normal circumstances.
This project includes many opportunities for future development. Socioeconomic actors like median household income in each zipcode could be taken into consideration. Similarly, unemployment, weather, or political data and events could be taken into consideration. We could also examine the data in the recovery from COVID, and see if crime patterns have changed. Finally, we could use other models and machine learning algorithms to make better predictions of the data.